Main Algorithm
- Transfer network to data matrix 3 methods
- The prior: is a single feature equal to the logit-transformed prior probablity of treatment. The prior incorporates the probability that a compound treats a disease based only on their treatment degrees.
- Degree features: correspond to each meta edge that connects either a compound or disease .(16 degree features, 8 compound degrees, 8 for disease degrees). Degree features are HIS transformed.
- DWPCs: make up the vast majority of features. Each DWPC corresponds to a meta path. (1,206 metapaths that have edge = 2 to 4)
- Select Observation or Select Features stage 1: all features and stage 2: all observation
- Why?: 25 million DWPC queries (209,168 observations * 1,206 DWPCs) is impractical
- all features: include all 755 positive and 4*755 random negatives (total 4.6 million DWPCs, takes 4 days and 11 hours to run)
- all observation: use all features data matrix to choose the features
- Select features by calculate each features AUROC (Daniel used this)
- LASSO
- Combine both two
- Transform the matrix Transform DWPCs
- why?:
- A los of 0 in the matrix
- Lots of bias
- Fit model (What I focus)
- Visualization of whole data matrix
# to see the dimension of all features matrix
dim(tran_df)
## [1] 3775 2070
3775 observation and 2070 features
Graph: the pattern for 0 value in the matrix
############ do the correlation between column
par(mfrow=c(1,1))
#####
# draw the correlation between degree columns
p0=data.frame(select(data.frame(X_dwpc), starts_with('degree')))
c0=cor(p0, use = "complete")
# the cut off of high correlation is 0.80
rem0=findCorrelation(c0, cutoff = .80)
p0.rem=cor(p0[,rem0] , use="complete")
p0.plot=corrplot(p0.rem, method = 'ellipse' , type="lower" ,
order="FPC",tl.col = "black")
Graph: the high correlation plot (correlation <0.80) for degree feature (second method) columns
# draw the correlation between dwpc columns
p1=data.frame(select(data.frame(X_dwpc), starts_with('dwpc')))
c1=cor(p1, use = "complete")
# the cut off =0.90 means only get out of the correlation >=0.90
rem1=findCorrelation(c1, cutoff = .95)
# since there are lots of features have correlation >=0.90 (length(rem1)=586)
# I use for loop to plot all out, each plot has only 10 columns, so we total have
# ceiling(length(rem1)/10) output graphs for high correlation columns(correlation>0.9)
j=30
for(i in 1:ceiling(length(rem1)/j)){
number=pmin(j*i,length(rem1))
p1.rem=cor(p1[,rem1[((i-1)*j):number]] , use="complete")
p1.plot=corrplot(p1.rem, method = 'ellipse' , type="lower" ,
order="alphabet",tl.cex=0.6)
}









All 11 Graph: the high correlation (correlation <0.95) plot for DWPCs feature columns
- Fit model
- lasso logistic regression
- Daniel does not use the test data to test his model, so his result for ROC is overfitting
- Random Forest
- Gredient Boosting
- Result
- Logloss
- If real Y =1 and predY closer to 1, then the log loss will be smaller. If real Y=0 and predY closer to 0, then the log loss will be smaller.
logloss function formula
##
## ----------------------------------------------------
## Lasso.logistic Random.Forest Gredient.Boosting
## ---------------- --------------- -------------------
## 0.3321 0.3264 0.2795
## ----------------------------------------------------
# plot roc curve
plot(glmnetPlotRoc,col="red",lty=1, lwd=2,main="ROC curves")
plot(randomForestPlotRoc,col="blue",lty=1, lwd=2,add=T)
plot(xgboostPlotRoc,col="green",lty=1, lwd=2,add=T)
legend(0.56,0.2, legend = c("logisticRegression", "randomforest","gradientBoosting"), lty=1,
lwd=2,col=c("red","blue" ,"green"),cex = 1,bty ="n")
legend(0.3,0.6,c(paste(c("AUROC for logistic = ","AUROC for randomforest = ",
"AUROC for gradientboosting = "),
c(round(glmnetAuroc, digits = 2),round(randomForestAuroc,digits=2),
round(xgboostAuroc,digits=2)),sep=""),"\n"),
border="white",cex=0.8,box.col = "white")

# plot prc curve
plot(glmnetPlotPrc,col="red",lty=1, lwd=2,main="PRC curves")
plot(randomForestPlotPrc,col="blue",lty=1, lwd=2,add=T)
plot(xgboostPlotPrc,col="green",lty=1, lwd=2,add=T)
legend(0.1,0.2, legend = c("logisticRegression", "randomforest","gradientBoosting"), lty=1,
lwd=2,col=c("red","blue" ,"green"),cex = 1,bty ="n")
legend(0.1,0.45,c(paste(c("AUPRC for logistic = ","AURPC for randomforest = ",
"AUPRC for gradientboosting = "),
c(round(glmnetAuprc, digits = 2),round(randomForestAuprc,digits=2),
round(xgboostAuprc,digits=2)),sep=""),"\n"),
border="white",cex=0.8,box.col = "white")

- Future Direction
- Expand features
- Deal with 0 in matrix
- Deal with correlation column in matrix
- Improve gradient boosting model